Closes #448 | Add/Update Dataloader alorese #541

patrickamadeus · 2024-03-19T15:35:11Z

Closes #448

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

T2T

SPTEXT

SPTEXT_TRANS

ljvmiranda921

I was only able to test the t2t subset (my PC's almost out of storage 😅 ), and it works. I'm ok with the implementation, just have a few comments. Can you please run make check_file=... just so that the formatting is consistent everywhere?

seacrowd/sea_datasets/alorese/alorese.py

ljvmiranda921 · 2024-03-27T22:15:11Z

seacrowd/sea_datasets/alorese/alorese_url.py

@@ -0,0 +1,710 @@
+_URLS_DICT = {


Just curious: how were you able to generate this dictionary? Do you think it's possible to automate this process instead instead of keeping this file?

Hi @ljvmiranda921 ! Based on the archive, here are the steps:

I scraped through the pagination to grab all of the blob url that leads to the detail page

From each detail page, I scraped all of the .wav blob url like this

Now for the hard part, in order to match the .wav with the matching .xml (caption), I scraped it through this blob and find the mosts similar naming for each corresponding file with the .wav filename

Sadly, the name does not always have the same file naming as the .wav file, so I have to recheck one by one. Not to mention there might be .wav that does not have any .xml

From 4, I reconsider to elicit my scraping script to the code since it might cause:

any trouble if slight UI change happens

slow process, pagination scraping of ~200 items, and it does not include the detail page scraping again + caption so the quickest is roughly ~200 * 3 = 600 URLs. Not to mention the time needed might vary more due to user's network speed difference.

Furthermore, I have gone through the discussion in the discord before implementing alorese cause this one is a little bit tricky. Got the approval to the such "bulky" method, but for the sake of better result. I can't really open my Discord now, will paste the discussion thread later on.

Ah no worries! Thanks for outlining the steps! I guess there's no need to have this be automated, what you did is OK already. Perhaps we should document that process somewhere? Maybe as a docstring in the alrose_url.py file? Just a short description to ensure that future folks can replicate where the URLs are coming from. :)

With pleasure! Please check the newest commit 😄.

ljvmiranda921

LGTM! Let's just wait for @sabilmakbar 's review

seacrowd/sea_datasets/alorese/alorese_url.py

sabilmakbar

init review

seacrowd/sea_datasets/alorese/alorese.py

sabilmakbar · 2024-04-12T14:10:07Z

wait I'm going to check it quickly, pardon for late response

sabilmakbar · 2024-04-13T13:54:10Z

seacrowd/sea_datasets/alorese/alorese.py

+    BUILDER_CONFIGS = [
+        SEACrowdConfig(
+            name=f"{_DATASETNAME}_{subset}_source",
+            version=datasets.Version(_SOURCE_VERSION),
+            description=f"{_DATASETNAME} source schema for {subset} subset",
+            schema="source",
+            subset_id=f"{_DATASETNAME}_{subset}",
+        )
+        for subset in SUBSETS
+    ] + [
+        SEACrowdConfig(
+            name=f"{_DATASETNAME}_t2t_seacrowd_t2t",
+            version=datasets.Version(_SEACROWD_VERSION),
+            description=f"{_DATASETNAME} SEACrowd schema for t2t subset",
+            schema=f"seacrowd_t2t",
+            subset_id=f"{_DATASETNAME}_t2t",
+        ),
+        SEACrowdConfig(
+            name=f"{_DATASETNAME}_sptext_seacrowd_sptext",
+            version=datasets.Version(_SEACROWD_VERSION),
+            description=f"{_DATASETNAME} SEACrowd schema for sptext subset",
+            schema=f"seacrowd_sptext",
+            subset_id=f"{_DATASETNAME}_sptext",
+        ),
+        SEACrowdConfig(
+            name=f"{_DATASETNAME}_sptext_trans_seacrowd_sptext",
+            version=datasets.Version(_SEACROWD_VERSION),
+            description=f"{_DATASETNAME} SEACrowd schema for sptext_trans subset",
+            schema=f"seacrowd_sptext",
+            subset_id=f"{_DATASETNAME}_sptext_trans",
+        ),
+    ]


may I ask few qns here?

overall, the schema we will use in here are T2T for translation of transcription of same audio from Indonesian to Alorese, and the SPText is the ASR version, right?

Which language does the audio content has? Indonesian or Alorese? bcs when I looked the code, there's no clear way to tell which lang is available in the transcription (and it's crucial to correctly map the Audio to the correct Transcriptions)

For the config name below:
{_DATASETNAME} _sptext_seacrowd_sptext and {_DATASETNAME}_t2t_seacrowd_t2t, may I know why the naming isn't something like {_DATASETNAME}_seacrowd_sptext and {_DATASETNAME}_seacrowd_t2t? any justifications here?

I saw in the source, the info on speaker_id went missing? I thought source config should include all columns and informations (and stitch them appropriately if the data scattered into different files and configs, like what you did in your code)

Hi @sabilmakbar , thanks for the review and questions!

Yes

The audio is in alorese, the lang available in transcription is denoted by either sptext for alorese or sptext_trans for indonesian. The mapping is done in this part

Because of the subset naming, the testing code (and dataset naming format) that was constructed needs to be in <DATASET_NAME>_<SUBSET_NAME>_seacrowd_`. It's just the subset name happens to be same as the schema, so it might be confusing. Happy to change if it might be needed.

Thanks for the input! Please review the latest commit as I have done the nitpick.

thanks for the prompt reply, @patrickamadeus!

The audio is in alorese, the lang available in transcription is denoted by either sptext for alorese or sptext_trans for indonesian. The mapping is done in this part

Idk whether we need to create a schema of alorese audio against Indonesian text for the SPText schema. my personal opinion is to remove it since it could be very misleading (ASR schema should provide a text which is an actual transcription as is, not the translated ones)

Because of the subset naming, the testing code (and dataset naming format) that was constructed needs to be in <DATASET_NAME>_<SUBSET_NAME>seacrowd`. It's just the subset name happens to be same as the schema, so it might be confusing. Happy to change if it might be needed.

If I understand this correctly, the subsets for this dataset are only the Alorese version and Indonesian version. The SPText and T2T don't fit properly to the definition of "subset" of a dataset, as only the schema is different.

wdyt? @ljvmiranda921 @holylovenia @SamuelCahyawijaya

Ah I see, hmm, does it mean we should have a separate data loaders for the SPText and the T2T versions? I don't have a particularly strong opinion on any approach.

Ah I see, hmm, does it mean we should have a separate data loaders for the SPText and the T2T versions?

No, we can have it in a single dataloader; I think the subset and configuration naming should be modified slightly.

The audio is in alorese, the lang available in transcription is denoted by either sptext for alorese or sptext_trans for indonesian.

The text information provided in this dataset is sequential; i.e., for every audio file, there is a sequence of annotated texts with their start & end timestamps.

For the Alorese language of text and audio, this can be put in ASR schema (or even a sequential audio split to its annotated text if we want to refine it further).

However, I don't think we should create an ASR schema for the Alorese audio and translated Indonesian annotation since the audio and text are in different languages.

And for T2T of Alorese and Indonesian translation, the existing implementation is correct, just need to reconstruct the configs list.

Therefore, my proposed configs are:

Source -- containing Alorese Audio, Alorese Annotation (and its timestamp), and Indonesian Annotation (and its timestamp too)

SPText -- containing Alorese Audio & its Annotation (the annotation could be combined into single text per audio -- as previously implemented or recreated using new sequenced schema of text)

T2T -- containing Alorese Annotation & translated Indonesian Annotation (we can leave this as full-text translation, not word-to-word or phrase-to-phrase translation)

I agree with @sabilmakbar's suggestion.

Btw, just confirming, do the audio recordings and the transcriptions match word-for-word, @patrickamadeus?

Hi! Thanks for the suggestions @sabilmakbar !

Btw, just confirming, do the audio recordings and the transcriptions match word-for-word?

Yes! There are multiple timestamps to indicate when each word is spoken. @holylovenia , to be honest I haven't reviewed substantial sample from the data to determine whether it matched word-for-word or no, but as I listened to the first 10 seconds of 1 particular example, it perfectly matched.

If there is no further suggestion or comment, I will implement the change maximum this weekend, got a bunch of stuff to do first.

Hi @sabilmakbar ! Could you please check on the latest commit? I have done the revision 👍.

SPText -- containing Alorese Audio & its Annotation (the annotation could be combined into single text per audio -- as previously implemented or recreated using new sequenced schema of text)

For this one, I went on using the previous implementation first.

sabilmakbar · 2024-04-23T02:39:45Z

seacrowd/sea_datasets/alorese/alorese.py

-        )
-        for subset in SUBSETS
-    ] + [
+    BUILDER_CONFIGS = [SEACrowdConfig(name=f"{_DATASETNAME}_source", version=datasets.Version(_SOURCE_VERSION), description=f"{_DATASETNAME} source schema", schema="source", subset_id=f"{_DATASETNAME}",)] + [


can we fix this formatting? :D

bueno @sabilmakbar ! it's done, sorry I forgot to delete the old ] bracket

seacrowd/sea_datasets/alorese/alorese.py

sabilmakbar · 2024-04-24T03:11:29Z

Hi @patrickamadeus, I already put in an updated review. Let both of us know if the suggestion has been addressed, prob both me and LJ need to re-run the whole checking once more to ensure it's already correct since this data loader is quite complex. Thx!

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

sabilmakbar · 2024-04-28T14:23:34Z

Hi @patrickamadeus, all looks good to me. Since LJ said he doesn't have much PC storage left (presumably), I'll proceed with the merge :) (I am able to download all data & subsets and tested it too).

How does it sound, @ljvmiranda921? If that's fine from your end, I'll approve and merge it

ljvmiranda921 · 2024-04-29T01:00:11Z

^Yes please feel free to merge! 🙇

fix formatting on `yield` of `_generate_examples`

sabilmakbar

lgtm!

patrickamadeus added 10 commits March 18, 2024 10:44

feat: dataloader for text2text MT

c58865d

nitpick: block sp2t to pass tc for t2t task

1c99845

nitpick join

aaba4fc

Merge branch 'master' into alorese

8fcffaf

feat: support sptext, sptext_translated

7513b50

feat: final alorese_source code

81fde25

chore: scrape entire URLs

8d7f639

nitpick

069dcb5

nitpick: config builder naming

1094c1c

fix: nitpick naming a bit

6d227cd

patrickamadeus requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners March 19, 2024 15:35

sabilmakbar self-assigned this Mar 19, 2024

SamuelCahyawijaya added the bonus +1 label Mar 25, 2024

holylovenia removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, holylovenia, yongzx and MJonibek March 25, 2024 09:43

holylovenia assigned ljvmiranda921 Mar 25, 2024

ljvmiranda921 reviewed Mar 27, 2024

View reviewed changes

seacrowd/sea_datasets/alorese/alorese.py Show resolved Hide resolved

ljvmiranda921 reviewed Mar 27, 2024

View reviewed changes

seacrowd/sea_datasets/alorese/alorese.py Outdated Show resolved Hide resolved

ljvmiranda921 reviewed Mar 27, 2024

View reviewed changes

patrickamadeus added 3 commits March 28, 2024 08:10

Merge remote-tracking branch 'upstream/master' into alorese

b4a61c6

nitpick PR: formatting, abs import, invalid schema handler

1bffd96

docs: add docstring scraping approach

874f856

ljvmiranda921 approved these changes Mar 31, 2024

View reviewed changes

sabilmakbar reviewed Apr 1, 2024

View reviewed changes

seacrowd/sea_datasets/alorese/alorese_url.py Show resolved Hide resolved

sabilmakbar reviewed Apr 1, 2024

View reviewed changes

seacrowd/sea_datasets/alorese/alorese.py Outdated Show resolved Hide resolved

seacrowd/sea_datasets/alorese/alorese.py Outdated Show resolved Hide resolved

patrickamadeus added 3 commits April 2, 2024 10:22

Merge branch 'master' into alorese

a413ea5

fix: add URL scrape timestamp, revise code formatting, citation

fa7f8b5

nitpick year

f22e93b

sabilmakbar added bonus +3 and removed bonus +1 labels Apr 12, 2024

sabilmakbar reviewed Apr 13, 2024

View reviewed changes

patrickamadeus added 3 commits April 13, 2024 21:24

nitpick review

7bfca50

Merge branch 'master' into alorese

2fa1a1e

fix: revise schema and remove subset

4876bf8

sabilmakbar reviewed Apr 23, 2024

View reviewed changes

nitpick formatting

bef5cb2

sabilmakbar reviewed Apr 24, 2024

View reviewed changes

seacrowd/sea_datasets/alorese/alorese.py Outdated Show resolved Hide resolved

Update seacrowd/sea_datasets/alorese/alorese.py

0677796

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

Update alorese.py

118063a

fix formatting on `yield` of `_generate_examples`

sabilmakbar approved these changes Apr 29, 2024

View reviewed changes

sabilmakbar merged commit 42d6285 into SEACrowd:master Apr 29, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #448 | Add/Update Dataloader alorese #541

Closes #448 | Add/Update Dataloader alorese #541

patrickamadeus commented Mar 19, 2024

ljvmiranda921 left a comment •

edited

Loading

ljvmiranda921 Mar 27, 2024

patrickamadeus Mar 28, 2024

ljvmiranda921 Mar 28, 2024

patrickamadeus Mar 28, 2024

ljvmiranda921 left a comment

sabilmakbar left a comment

sabilmakbar commented Apr 12, 2024

sabilmakbar Apr 13, 2024

patrickamadeus Apr 13, 2024

sabilmakbar Apr 13, 2024

ljvmiranda921 Apr 14, 2024

sabilmakbar Apr 16, 2024

holylovenia Apr 17, 2024

patrickamadeus Apr 17, 2024

patrickamadeus Apr 20, 2024

sabilmakbar Apr 23, 2024

patrickamadeus Apr 23, 2024

sabilmakbar commented Apr 24, 2024

sabilmakbar commented Apr 28, 2024

ljvmiranda921 commented Apr 29, 2024

sabilmakbar left a comment

Closes #448 | Add/Update Dataloader alorese #541

Closes #448 | Add/Update Dataloader alorese #541

Conversation

patrickamadeus commented Mar 19, 2024

Checkbox

T2T

SPTEXT

SPTEXT_TRANS

ljvmiranda921 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ljvmiranda921 left a comment

Choose a reason for hiding this comment

sabilmakbar left a comment

Choose a reason for hiding this comment

sabilmakbar commented Apr 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sabilmakbar commented Apr 24, 2024

sabilmakbar commented Apr 28, 2024

ljvmiranda921 commented Apr 29, 2024

sabilmakbar left a comment

Choose a reason for hiding this comment

ljvmiranda921 left a comment •

edited

Loading